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Abstract 



In this study, we propose and analyze in simulations a new, highly flexible method of imple- 
menting synaptic plasticity in a wafer-scale, accelerated neuromorphic hardware system. The 
study focuses on globally modulated STDP, as a special use-case of this method. Flexibility is 
achieved by embedding a general-purpose processor dedicated to plasticity into the wafer. To 
evaluate the suitability of the proposed system, we use a reward modulated STDP rule in a spike 
train learning task. A single layer of neurons is trained to fire at specific points in time with 
only the reward as feedback. This model is simulated to measure its performance, i.e. the in- 
crease in received reward after learning. Using this performance as baseline, we then simulate 
the model with various constraints imposed by the proposed implementation and compare the 
performance. The simulated constraints include discretized synaptic weights, a restricted inter- 
face between analog synapses and embedded processor, and mismatch of analog circuits. We 
find that probabilistic updates can increase the performance of low-resolution weights, a simple 
interface between analog synapses and processor is sufficient for learning, and performance is 
insensitive to mismatch. Further, we consider communication latency between wafer and the 
conventional control computer system that is simulating the environment. This latency increases 
the delay, with which the reward is sent to the embedded processor. Because of the time continu- 
ous operation of the analog synapses, delay can cause a deviation of the updates as compared to 
the not delayed situation. We find that for highly accelerated systems latency has to be kept to a 
minimum. This study shows the proposed implementation to be suitable to model reward modu- 
lated STDP learning rules. It is therefore an ideal candidate for implementation in an upgraded 
version of the wafer-scale system developed within the BrainScaleS project. 



Keywords: neuromorphic hardware, wafer-scale integration, large-scale spiking neural net- 
works, spike-timing dependent plasticity, reinforcement learning, hardware constraints analysis 
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1. Introduction 



In reinforcement learning, an agent learns to achieve a goal through interaction with an environ- 
ment (Sutton and Barto, 1998). The environment provides a single scalar number, the reward, 
as feedback to the actions performed by the learning agent. The agent tries to maximize the 
reward it receives over time by changing its behavior. In contrast to supervised learning, where 
an instructor supplies the correct actions to take, here the agent has to learn the correct strategy 
itself through trial-and-error. Typically this is done by introducing randomness in the selection 
of actions and taking into account the resulting reward. Recently, several studies have suggested 
extending classical spike-timing dependent plasticity (STDP, Morrison et al., 2008; Caporale and 
Dan, 2008) into reward-modulated STDP to implement reinforcement learning in the context of 
spiking neural networks (Izhikevich, 2007; Farries and Fairhall, 2007; Florian, 2007; Legenstein 
et al., 2008; Fremaux et al., 2010; Potjans et al., 2011). One of the key issues in reinforcement 
learning is solving the so-called temporal credit assignment problem: reward arrives some time 
after the action that caused it. So how does the agent know how to change its behavior? It needs 
to retain some information about recent actions in order to assign proper credit for the rewards 
it receives. To do this, reward modulated STDP generates an eligibility trace for every synapse 
that depends on pre- and postsynaptic firing. This trace, modulated by the reward, determines 
the change of synaptic weight, thereby solving the credit assignment problem. 

Spike-based implementations do not only offer an approach to biological models of learning, 
they are also suitable for implementation in neuromorphic hardware devices. Existing systems 
offer a number of interesting characteristics, such as low-power consumption (e.g. Wijekoon and 
Dudek, 2008; Livi and Indiveri, 2009; Seo et al., 201 1), faster than real-time dynamics (Wijekoon 
and Dudek, 2008; Schemmel et al., 2010), and scalability to large networks (Schemmel et al., 
2010; Furber et al., 2012). They are typically built with two goals in mind: as new kind of brain 
inspired information processing device and to provide a scalable platform for the experimental 
exploration of networks. Currently, learning capabilities are limited to variants of unsupervised 
STDP (Indiveri et al., 2006; Schemmel et al., 2006; Seo et al., 2011; Ramakrishnan et al., 201 1; 
Davies et al., 2012). 

In this study we analyze the implementability of a reward-modulated STDP model derived 
from Fremaux et al. (2010) in neuromorphic hardware. To that end, we propose an extended ver- 
sion of the BrainScaleS wafer-scale system (Schemmel et al., 2008; Fieres et al., 2008; Schem- 
mel et al., 2010) to serve as a conceptual basis for this analysis. This system is designed as a 
power-efficient, faster than real-time and flexible emulation platform for large neural networks. 
In particular the acceleration in time compared to biology make it interesting for reinforcement 
learning, which typically suffers from slow convergence (Sutton and Barto, 1998). Starting from 
an existing system with limited modifications leads to a more realistic design prototype compared 
to starting from scratch. 

A key objective for the proposed neuromorphic system is to be a valuable tool for neuro- 
science. Therefore, the design must not be targeted at a single network architecture, task or 
learning rule, but instead stay as flexible as is reasonably possible. On the other hand, imple- 
menting large-scale neural networks with accelerated time-scale raises technical challenges and 
trade-offs have to be made between flexibility and performance. The proposed extension rep- 
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resents a plasticity mechanism reflecting this design philosophy: specialized analog circuits in 
every synapse are combined with a general purpose embedded plasticity -processor (EPP). This 
way, the benefits from the worlds of analog and processor-based computing can be combined: 
analog circuits are used for compact, power-efficient and fast local processing, and digital pro- 
cessors allow for programmable plasticity rules. Integrating the processors into the same appli- 
cation specific integrated circuits (ASIC) on the wafer as the neuromorphic substrate allows for 
scalability to wafer size networks and beyond. 

In the following, we will consider only the aforementioned rule studied in Fremaux et al. 
(2010) and analyze effects caused by the adaptation to the hardware system in simulations. We 
want to answer the question whether the hybrid approach combining processor and analog cir- 
cuits is a suitable platform for this particular learning rule. Among the hardware-induced con- 
straints are non-continuous weights, drift of analog circuits and communication latency between 
hardware substrate and the controlling computer system. We want to test and compare the perfor- 
mance of the unconstrained and the constrained plasticity rules in order to find guidelines for the 
hardware implementation, for example the required weight resolution or maximum noise levels. 
Section 2 describes the extended hardware system and the plasticity model. Section 3 presents 
results from simulations showing performance under hardware constraints. Section 4 provides a 
discussion of our results. 

2. Materials and methods 

2.1 . Using an embedded processor for plasticity 

The key concept of our hardware implementation of synaptic plasticity is to use a programmable 
general-purpose processor in combination with fixed-function analog hardware. Software run- 
ning on the processor can use observables and controls to interface with the neuromorphic sub- 
strate. Thereby, it is possible to flexibly switch between synaptic learning rules or use different 
ones in parallel for different synapses. The alternative to this concept would be to use fixed- 
function hardware instead of a general-purpose processor. This would allow a more efficient 
implementation of one specific rule, at the cost of system versatility. In the following, we give 
background information on a complete neuromorphic system following the concept of processor- 
enabled plasticity. From the system described, we derive hardware constraints that are used in 
the simulations reported in Section 3. 

2.1 .1 . System overview 

Figure 1 gives a schematic overview of the complete hardware system. The experimenter controls 
the system through a control cluster of off-the-shelf computers. The network is provided in a de- 
scription abstracted from the details of the system using the PyNN modelling language (Davison 
et al., 2008). An automated mapping process translates the description into the detailed configu- 
ration that is written to the wafer-modules (Ehrlich et al., 2010; Wendt et al., 2008). These mod- 
ules are interconnected by a high-speed network to communicate spike-events (Scholze et al., 
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Figure 1 : Overview of the system. The user controls the system through a cluster of conventional 
computers by sending configuration and spike data to a number of modules that each 
carry a wafer. These wafer modules are interconnected with a high-speed network to 
exchange spike events. The wafer contains identical building blocks, of which one is 
shown in an expanded view. The proposed extension to the BrainScaleS wafer-scale 
system in form of the embedded plasticity processor is marked in red. Input/output 
access from the processor to other components of the building block is indicated with 
triangles. 
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2011). External stimulation can be applied to the network from the control cluster, using the 
high-speed links that are also used for configuration. The wafer itself is subdivided into building 
blocks that contain the neuromorphic substrate, i.e. synapses, neurons, parameter storage and 
networking resources for spike transmission. 

Our proposed extension adds an embedded plasticity processor (EPP) to every building block 
on the wafer, together with its own memory for instructions and data. It will be equipped with 
three interfaces to the fixed-function hardware: read and write access on the synapses, rate coun- 
ters and event generation for the network and access to the control bus of the building block. 
The latter is also used by external control accesses and thus, a plasticity program running on the 
embedded processor will be able to do everything that could be done from an off-wafer control 
computer as long as it only requires information local to the block. There is no direct commu- 
nication channel between processors envisioned, but software on the control computer could be 
used for data exchange. 



2.1.2. Implementing plasticity 

Our proposed design represents a hybrid system, in which the digital EPP interacts closely with 
analog components. Every synapse contains an analog accumulation circuit, similar to the ver- 
sion used in an earlier design (Schemmel et al., 2007). For each pre-post and post-pre spike-pair, 
the time difference At is measured and weighted exponentially using the amplitude A± and time 
constant t± : 

5 ± = A±expU^Y (1) 

These values are added to two local capacitors a + and a_, respectively. In the extended version 
the EPP will select synapses for readout and use an analog evaluation unit to produce a series 
of bits bi out of a + and a_ . The evaluation function can perform different readout operations 
controlled by configuration bits e l cc , e l ca , e l ac and e l aa and analog parameters a tl and a th : 

I y: atl+e l ac a++el a a- a th +e' cc a++e' aa a- 

! + e a C + 4a !+4c+4a _ (2) 

otherwise 

Using b . . . bx-i, the current weight of the synapse w and possibly further global parameters 
P . . . Pm-i as input, the weight update A is then calculated in software by the EPP: 

A = F (b , • • • , b N - U w, P ,..., P M -i) (3) 

Then, the new weight w' = w + A is written to weight storage by the plasticity program. Using 
two evaluations b , bi with different sets of configuration bits, a simple example for F would be: 

F(6o,6i) = A b + A 1 b 1 (4) 

With arbitrary constants A and A x . 

Synapses in the system are organized in an array of synapse-units, where each synapse has 
a 4 bit weight memory implemented with static random-access memory (SRAM) cells. These 
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Figure 2: Micro-architecture of the embedded plasticity processor. The design is separated into 
frontend and backend. The frontend takes four clock cycles to decode instructions 
and issue them in-order to the applicable functional unit. The functional units take a 
minimum of two cycles. Writing the result back to the register file takes another cycle. 
Input/output operations are performed through a bus interface served by the load/store 
unit and a specialized interface to the synapse array. 



offer the ability to combine adjacent units to increase resolution to 8 bit. Of course this has the 
negative effect of reducing the total amount of implementable synapses. 

2.1.3. Embedded micro-processor 

Plasticity algorithms will be implemented by software programs executed on the EPP. A large 
class of micro-processors is in use today for various different applications from supercomputers, 
to smartphones and embedded controllers for traffic lights. They all use different computer 
architectures reflecting the specific requirements and constraints of their application. 

There are three important characteristics for a processor: one, the used instruction set archi- 
tecture (ISA) that defines coding and semantics of instructions and registers. Two, whether in- 
structions are executed out-of-order and three, whether the design is super-scalar, i.e. instructions 
can execute in parallel. The instruction set architecture used here is a subset of the PowerlSA 
2.06 specification for 32 bit (PowerlSA, 2010). The main reason to use an existing ISA is the 
availability of compilers and tools. Code for the EPP can be generated using the GNU Compiler 
Collection (Stallman, 2012), using the C programming language. 

The micro-architecture of the EPP is shown in Figure 2. The frontend fetches and issues 
instruction in program order to the functional units. Due to different latencies, instructions can 
retire out of program order to the write back stage. For example a slow memory access may 
be overtaken by a quick add instruction issued after it. Program and data are stored in a 12kiB 
memory. A direct-mapped cache {ICache) is used for instruction access and to avoid the von- 
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Neumann bottleneck (Backus, 1978). Branches can be predicted with a fully associative branch 
predictor using 2 bit saturating counters to track branch outcome (Strategy 7 in Smith, 1998). 
The functional units include load/store for memory access, a branch facility for control transfers, 
fixed-point arithmetic and logical instructions including a barrel shifter, multiply and divide. 
The SYNAPSE special-function unit implements application specific instructions and registers. It 
allows for accelerated weight computation and synapse access. 

An important goal for our proposed design is to maintain small area requirements to allow 
integration into the existing BrainScaleS wafer-scale system. To this end, we chose in-order 
issue of instructions to avoid additional control logic associated with tracking of instructions 
and reordering. However, out-of-order completion can be achieved with relatively small area 
overhead using a result shift-register (Smith and Pleszkun, 1985) and was therefore included to 
improve performance. 

2.2. Model for reinforcement learning 

To demonstrate reinforcement learning using the proposed system architecture, we chose a plas- 
ticity rule and a learning task described in Fremaux et al. (2010). The R-STDP rule (Izhikevich, 
2007; Florian, 2007) is a three-factor synaptic plasticity learning rule that modulates classical 
two-factor STDP with a reward-based success signal S. At the end of each trial of the learning 
task, a reward R is calculated according to the performance of the network and is used to modify 
the weights according to the learning rule. 

2.2.1. Network model 

The network we simulate consists of two layers, connected with plastic synapses using the 
reward-modulated learning rule. The input layer consists of units repeating a given set of spike 
trains. The output layer consists of spiking neurons, being excited by the fixed activity from the 
input layer. 

The original network in Fremaux et al. (2010) uses the simplified Spike Response Model 
(SRM , Gerstner and Kistler, 2002) for the output neurons. It is an intrinsically stochastic neu- 
ron that emits spikes based on the exponentially weighted distance to the threshold. In hardware 
the most commonly used neuron type is the deterministic leaky integrate-and-fire (LTF). The 
proposed system would use the hardware neuron reported in Millner et al. (2010) that can be 
operated as Adaptive Exponential Integrate-and-Fire (AdEx, Brette and Gerstner, 2005) or con- 
ventional LIF model. Since a certain amount of randomness in the firing behavior is required for 
reinforcement learning, we add background noise stimulation in the form of Poisson processes. 

A tabular description of the network model can be found in Table 1 . Njj input units project 
onto Nt neurons that are additionally stimulated by Nb random background sources. All neu- 
rons are connected to all inputs, but each has individual random stimulation from equally sized 
and disjoint subsets of the random background. In every trial the same input spike pattern is 
presented, but the background noise realization is different. 

For each input % = . . . Njj — 1, the input pattern consists of randomly drawn spike times 
Sij E U (0, Atrial) with j = . . . iV st i m — 1, where U (0, £ t riai) is the uniform distribution on the 
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interval [0,t t riai]- All simulations use the same input spike times SV, that are generated once to 
ensure comparability. 

Weights for the random background have a uniform value wb, so that every background spike 
causes the neuron to fire. Weights for input synapses are initialized to ws, chosen so that single 
input spikes do not cause firing. See Table 2 for the numerical values. 



2.2.2. Synaptic plasticity model 

In the reward modulated STDP learning rule, the outcome of standard STDP drives so-called 
eligibility trace changes Ae^: 

Ae k = r]A ± exp^-^^j , (5) 

with learning rate 77, time-difference between pre- and post-synaptic spike Atk for the A;-th pair, 
STDP time constant r + for pre-before-post pairings, r_ for post-before-pre pairings, and, in the 
same fashion, amplitude parameters A±. The Ae^ are accumulated on a per-synapse eligibility 
trace e. This trace decays exponentially with time constant r e : 



e(t) = ^2 Ae/c exp ( - 

h V 



t - *fc\ 

(6) 



t k <t 

with tk being the time of the post-synaptic spike for pre-before-post pairings and of the pre- 
synaptic spike otherwise. 

To calculate the weight update, a success signal S is used as modulating third factor. It repre- 
sents the difference between reward received R and a running average of reward R 

S = R-R (7) 

The reward is given at the end of each trial according to the learning task as defined in the next 
section. The running average is calculated as R n+ i = R n + (R n — Rn) /5 for the n-th trial. The 
weight update is then given by 

A = Se (W) (8) 

with the trial duration t^. 

In Fremaux et al. (2010) different time constants for pre-before-post (r+ = 20 ms) and post- 
before-pre (r_ = 40 ms) are used. The amplitudes A + and A_ are chosen so that both parts are 
balanced, i.e. A + r + = —A-t— Synapses of the BrainScaleS wafer-scale system are designed 
for time constants of 20 ms. We do not want to assume, that this can be increased by a factor 
of two and therefore, we reduce t_ to the same value as r+. Consequently we also use identical 
amplitudes to keep the STDP window W balanced. The plasticity rule described in this section 
represents the theoretical ideal model for our comparison that we refer to as the baseline model. 
Section 2.2.4 describes how this is mapped to hardware and the resulting constraints. 
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Table 1: Description of the network model used for the learning task after Nordlie et al. (2009). 
See Table 2 for numerical values of the parameters. 



A: Model summary 



Populations 
Connectivity 
Neuron model 

Synapse model 

Plasticity 

Input 


Three: input U, random background B, target T 
Feed-forward 

Leaky-integrate-and-fire, fixed voltage threshold, fixed absolute refractory pe- 
riod (voltage clamp) 

Exponentially shaped post-synaptic conductances 
Three-factor STDP 

Fixed-length spike-trains with uniformly distributed firing times 


B: Populations 


Name 

U 
B 
T 


Elements Population size 
Stimulus generator Njj 
Poisson generator Nb 
LIF neurons Nt 


C: Connectivity 


Source 

U 
B 


Target Pattern 

T All-to-all, initial weights ws 

T Non-overlapping 250 — > 1 , weight wb 


D: Neuron and synapse model 


Name 
Type 

Sub-threshold dynamics 
Spiking 


LIF neuron 

Leaky integrate-and-fire, exponential-shaped synaptic conduc- 
tances 

\C m %=g L (E L —V)+ g(t) (E e - V) if t > f + r ref 

|y(t) = y reset else 

g(t) = wexp (-t/r syn ) 

if V{t-) < V tb A V(t+) > V th 

1. set t* = t 

2. emit spike with time-stamp t* 


E: Plasticity 


Name 

Spike pairing scheme 
Weight dynamics 


Three-factor STDP 

Reduced symmetric nearest-neighbor (Morrison et al., 2008) 
A = Sa(t) 

«(*) = £^ A± exp (J^l) exp (-^) 

W G [iWmin, W max ] 


F: Input 


Type 

Stimulus generator 
Poisson generators 


Target Description 

U A^stim spikes at random firing times distributed uni- 
formly within the trial duration. 
B Independent Poisson spike-trains with rate vb 
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Table 2: Numerical values for parameters. For parameter definitions see Table 1 and text. 



Parameter 


Value 


N v 


250 


N B 


N T ■ 250 


N T 


5 


r 


500 pF 


9l 


10 nS 


E l 


-70 mV 


E e 


OmV 


Tref 


10 ms 


Preset 


-60 mV 


v th 


-50 mV 


At 


±32 pS 


T± 


20 ms 


Te 


0.1... 1000 s 


^min 


OnS 


^max 


0.5 nS 


W B 


20.0 nS 


W S 


0.21 nS 


W 


0.45 nS 




0.008 Hz 


Atrial 


Is 
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2.2.3. Learning task 



In reinforcement learning, reward given is determined by the nature of the learning task con- 
sidered. In our case, the goal of the network is to reproduce a given target spike train. Hence, 
reward should be given in proportion to the similarity of the actual and target outputs, as mea- 
sured by some metric. Here, we use a normalized version of the metric _D spike [g] by Victor and 
Purpura (1996). D sp±e [q] represents the minimal cost of transforming the output of a trial into the 
target pattern by adding, deleting and shifting spikes. Adding and deleting have unit cost, while 
shifting by At has a cost of qAt. For At > 2/q, deleting the spike and adding a new one at the 
correct time is cheaper than shifting it. Therefore, the parameter q controls the precision of the 
comparison. The cost parameter is set to 1/q = 20 ms for our simulations. 

Thus in a trial where neuron j fires with a spike train X out j and the target was X target , the 
contribution of neuron j to the reward is 



p -| -D Splk6 [g] (^out,j, ^target- „ 

3= — n t -+N t ; — ' (9) 

1 * out j ~ 1 v target 



where iV out j and N Xmget are the number of spikes in X out j and X target respectively. Because D sp±e [q] 
is bound to [0, A^ out j + iVt arg et], Rj is limited to [0,1]. The total reward R used for the weight 
update is the average of Rj over all Nt neurons. 

The target spike train is generated by simulating the neural network with a set of reference 
weights Wij for inputs i = . . . Nu — 1 and neurons j = . . . N T — 1. All simulations use the 
same set of reference weights to ensure fair comparison: 




, . ... , N , ifO < i < ^ 
W ij = { \«vJ ~ ~ 2 (10) 

if tf<i<Nu 



with W = 0.45 nS. An example of an output spike pattern produced by the network is shown 
in Figure 3. A new target spike train is generated at the beginning of every simulation run. Its 
firing times can be different even for identical weights and stimulation, because of the random 
background stimulation. 



2.2.4. Simulated hardware constraints 

The baseline plasticity model described in Eqs 5-8 can not be reproduced exactly by the proposed 
system. This results in two distinct classes of effects: trade-offs introduced on purpose to reduce 
costs, for example in area, and non-ideal behavior of the hardware system. 

In the first category, we analyze the effect of discretized weights and a limited access to analog 
variables by software running on the EPP. For the second category we study leakage in analog 
circuits and timing effects caused by finite processor speed and communication latencies. 



Discrete Weights In the hardware system, synaptic weights are discretized since they 
are stored as digital values in the synapse circuit. The number of bits per synapse is a critical 
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design decision when building a neuromorphic hardware system. Having fewer bits saves wafer 
area, so that more synapses can be implemented. More bits, on the other hand, allow for a higher 
dynamic range of the synaptic efficacies. The weight resolution also defines the minimum step 
size that can be taken by a learning rule. To analyze the sensitivity of learning performance to 
weight resolution, we modify the baseline model to use discrete weights with different numbers 
of bits. On a learning rule update, we precisely calculate the new weight (64 bit floating point) 
and round it to the nearest representable discrete weight value. The tie-breaking rule is round-to- 
even. 

In the case of non-continuous weights with r bits, all updates with 

I A I < ^max VJmin (11) 

1 1 2 2 r - 1 

are discarded by rounding. Here u> mm and w max are the minimum and maximum weight values 
that can be represented and A is the true weight update (see Equation 8). Fewer bits per synapse 
means that more updates are discarded, causing the effective learning rule to increasingly deviate 
from the baseline learning rule. 

A workaround to this problem is to perform discretized updates A d probabilistically, depend- 
ing on the exact weight update A as given by Equation 8. In this way, some of the updates that 
would otherwise be lost can be preserved. Using the correct update probabilities results in the 
average weight change being identical to that of the baseline model, i.e., without discretization. 

To see this, we note that A d can only assume values that are multiples of the discretization step 
5 r = (w max — w m i„) I (2 r — 1), assuming w m [ n = 0. If the baseline weight change A is between 
the (k — l)-th and k-th step, the discrete update A^ is picked from those with probability 1 — p 
and p, respectively. Such a scheme leads to the average update ( A d ) for a given A being 

(A d ) = k5 r p + {k-l)8 r {l-p) (12) 
= 6 r {k-l)+8 r p. (13) 



By picking p as 



it holds that (A d ) = A. 



A - (k - l)S r 

P = - x — , (14) 



Fhresholded readout The eligibility trace is implemented using the analog accumu- 
lation in the synapse unit. For every spike pair, Equation 1 is evaluated and the corresponding 
eligibility trace change is added as charge on the local storage capacitors a + and a_ respectively. 
These values are not directly accessible to the EPP. Instead, using the evaluation unit described 
in Section 2.1.2 with threshold = a th — a fl , accumulation trace a = a + — a_, configuration 
bits e+ = 1, e+ = 1, e+ = 0, e+ = for the evaluation of b + and e~ = 0, e~ = 0, e~ = 1, 
e~ = 1 for 6_ , the readout computes 

^j 1 if ±K— )>». (15) 

otherwise 
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The weight update with threshold readout A t is then performed using an update constant A 



A t = SA(b + -b_) . 



(16) 



The paraameters 6 and A should be chosen so as to minimize the deviation introduced by 
calculating weights according to Equation 16 instead of Equation 8. Ideally, one would like 
to satisfy (A t ) = A. However, detailed analysis of the simulations (not shown) showed that 
the eligibility trace distributions for different synapses at different stages of learning were very 
different. In that context, choosing parameters and A that minimize the difference between 
the baseline change A and the average effective change (A t ) for a particular synapse would not 
in general have the same effect for other synapses. Instead, we resort to a heuristic method to fix 
global threshold and update constant, described below, and assess its effectiveness in simulations. 

For the simulations presented here, a precursor run over 100 trials without learning was used 
to measure the final absolute eligibility value (|a|) averaged over all readout operations. The 
threshold was then set to 0* = (\a\) for the actual learning simulation. In this way, the 
average (across synapses) final eligibility value encountered during weight updates is close to 
the threshold. This represents a trade-off between exceeding the threshold only seldom, but then 
causing large - possibly disruptive - weight changes, and exceeding the threshold often, but only 
applying small changes. 

With N p (0) being the number of readout operations that exceed the threshold, i.e. b + or 6_ 
are non-zero, and the total number of readout operations N, the update constant A is set to 



Thereby, the mean absolute eligibility value used with the readout N P (Q*)A* /N is effectively 
the same as (\a\) in the baseline model. 

Analog drift The local accumulation units in the hardware synapses do not have a mech- 
anism for controlled decay of the eligibility trace. An ideal implementation of the circuit would 
stay unchanged over time, after a spike-pair has caused an update. In reality there are leakage 
currents causing the accumulation traces a + and a_ and their difference a to drift. Leakage is 
caused by a number of processes that depend on transistor geometry, manufacturing process, 
temperature and internal voltages (Roy et al., 2003). It is therefore difficult to predict either 
time-scale, shape or variability of this effect. We try to get an estimate on the sensitivity of 
the model to uncontrolled temporal drift, by simulating learning with a drift function <pi (t; do). 
Here t is the duration of the drift and a is the starting value for t = 0. The index i is over 
all synapses and both trace polarities. This function describes the development of a + (t) and 
a_ (i) between spike-pair induced updates. The accumulation value is given as the difference 
a (t) = a + (t) — a_ (t). We define an exponential drift function 




for Ai > 
for Aj < 
else , 



(18) 
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where a max is the maximum value that a + and a_ can assume and A, = l/r ej j is the inverse time 
constant. Positive \ leads to exponential decay as it was used so far. Negative \ causes a drift 
away from zero, towards the limit a max . For every synapse and for positive and negative traces, 
r e j is drawn from a Gaussian distribution with mean r e and standard deviation m e r e using the 
mismatch factor m e . In the limit of large t, this allows for four final states of a (t): Decay to 
zero, drift to a max or — a max and remaining constant at a for Aj = 0. 

It is important to note that we do not intend to precisely model the leakage behavior of the 
analog circuit. Instead, we use a simple model capturing the essence of drifting analog values to 
get an estimate for the sensitivity to this effect. 

Delayed reward The hardware system is a physical model of the emulated network. 
Therefore, emulated time progresses continuously during network operation with the accelera- 
tion factor a relative to wall-clock time. During all communication and computation, network 
operation continues. The amount of reward for each trial is calculated by the control cluster, after 
the nominal trial duration has ended and output spike events have been transmitted to the clus- 
ter. The success signal is then determined and sent back to the embedded processor. Then, the 
plasticity program will sequentially execute the weight update for all synapses taking a certain 
amount of time per synapse. This time is consumed by the synapse array access and the weight 
computation. 

These two effects are modeled by adding a constant delay D R after the trial has finished and an 
update rate v s giving the number of updated synapses per second. The weight update for synapse 
% occurs at ti = t aia \ + D R + The order in which synapses are updated is determined by their 
position in the synapse array and is therefore a result of the automated mapping process. For 
this study, we assume weight updates to be fast enough compared to the reward delay D R and 
therefore use U = ttriai + Dr. 

The delay causes a deviation from the ideal model because the accumulation capacitors a+, 
a_ used to store the eligibility trace continue to decay. The eligibility value used for the weight 
update is then reduced by a factor 



This can prevent a weight update that would have been made in the non-delayed case by reducing 
a below the readout threshold 0. We assume that the delay D R is known or can be estimated and 
lower the threshold to f3Q. 

In theory, this would allow to correct for arbitrary delay, since the exponential decay never 
reaches zero. In hardware this is not the case, because the eligibility readout is subject to noise. 
Therefore, after a certain delay, traces will be indiscernible from noise. To account for this, we 
simulate Gaussian distributed noise 5a on the readout with standard deviation a a and mean 0. 
The value used for comparison to the threshold is then given by a' = a + 5a. If a signal-to-noise 
ratio z* is required for correct learning, a limit D max for the delay can be calculated using the 
signal-to-noise ratio z(t) = a(t)/a a 




(19) 




(20) 
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With z (-Dmax + Atrial) = z* and a (Atrial) = Omax, the maximally tolerable delay in the presence of 
noise is given by 



D 



max 



-r e ln 




max 



a 



) 



(21) 



2.2.5. Measuring performance 

Simulations consist of 10000 trials in 20 parallel runs with different random seeds. At the begin- 
ning of every run, 100 trials are simulated without learning: during this time the running average 
R can settle to a stable approximation of the reward. The average over R during these trials is 
used as the initial reward level -Rbefore °f this run - During the last 1000 trials of the simulation, it 
is assumed that learning has reached a stable state: the final reward level .Rafter is the average of 
R over these trials. 

The model is simulated using the Brian simulator (Goodman and Brette, 2008). Weight up- 
dates are calculated with custom Python code using the NumPy package (Numpy, 2012). 



In the previous section, we analyzed a synaptic learning rule (Izhikevich, 2007; Florian, 2007; 
Fremaux et al., 2010), and the necessary adjustments that have to be made in order to implement 
it on a hardware system. The goal of this section is to quantify the sensitivity to constraints of the 
system - for example discretized weights or imperfections of analog circuits - to identify those 
critical for the model. Starting from the baseline configuration without hardware effects, we add 
constraints and measure their effect on the learning performance. 

3.1. Baseline 

The baseline model implements the learning rule described in Section 2.2 and Table 1 without 
hardware effects, and serves as comparison for simulations including such effects. The eligibility 
trace e of the theoretical model is identified with the local accumulation a in hardware synapses. 
Thereby, changes to the weight are deferred until the success signal S is given from the attached 
control cluster, after the produced spike train has been evaluated. New weights are assumed to 
be calculated using a software program running on the EPP. 

The raster plot in Figure 3 shows the output spike train at several points in time during a 
learning simulation. In the beginning at trial 0, spikes are generated randomly by the background 
stimulation. Later on, the network learns to produce spikes at the targeted points of time indicated 
with red vertical bars. In the last trial, neurons fire close to most of the target times. The evolution 
of the reward obtained in each trial averaged over 20 runs is shown in Figure 5A. Variance in the 
last 1000 trials is due to the random background stimulation and to the exploratory behavior it 
generates in the learning rule. Most of the performance improvement is achieved within the first 
2000 trials, the final level of reward being R b J£ = 0.54 ± 0.05. 



3. Results 
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Figure 3: Raster-plot of output spike-events for all five neurons at intervals of 2000 trials. Red 
bars indicate the target firing times. 
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This is the result using one particular set of reference weights Wij and stimulation pattern 5y 
that were defined in Section 2.2.1. To test how well this result generalizes to other weights and 
stimulation patterns we perform two additional experiments: first of all, we randomize the ref- 
erence weights, so that in 20 simulation runs the network learns with a different set of reference 
weights in each run. These weights are drawn randomly from a uniform distribution, so that the 
k-th run uses reference weights G U (w m i n , w max ) to generate its target spike train. This gives 
a final level of reward of -R^ ter = 0.59 ± 0.08 averaged over the 20 runs with different reference 
weights. 

In the second experiment we again use the reference weights for all 20 simulations. The 
stimulation pattern is randomized by drawing new spike times for each run from a uniform dis- 
tribution, so that the k-th run uses spike times G U (0, t t riai) for all trials. This gives a 
performance -Raf ter = 0.53 ± 0.08 averaged over the 20 different sets of stimulation patterns. 

The final reward level for the baseline simulation, randomized reference weights and random- 
ized stimulation pattern are shown in Figure 4. The data show, that the from here on used special 
case of reference weights and stimulation spike times Sy is within the performance range of 
randomly selected reference weights and input spike timings. The variances on -R!^ ter and -Rafter 
also show that there is considerable variation in the unconstrained theoretical model. To reduce 
variation in our results, so that changes caused by hardware effects are more visible, we use 
and Sij from here on. 

3.2. Discretized weights 

In designing the neuromorphic hardware system, one is faced with a trade-off between imple- 
menting more synapses with lower bit resolution and less synapses with higher resolution. There- 
fore, we would like to know how many bits are required for each synaptic weight to achieve good 
performance in the learning task. We perform a three-way comparison between the baseline 
model, a deterministic algorithm that simply rounds calculated weights to allowed representa- 
tions and a probabilistic variant as outlined in Section 2.2.4. Using deterministic weight updates, 
all updates satisfying Equation 1 1 do not cause a weight change. With fewer bits more updates 
are lost and learning performance is expected to suffer. This is what can be seen in Figure 5. The 
simulations shown there compare performance of the baseline model, to a constrained model 
with discretized weights of decreasing resolution. Figure 5A also shows the full reward trace of 
a single run picked arbitrarily. The plot exhibits a number of sharp drops in reward that last only 
for a few trials, before returning to the previous performance level. The final level of performance 
is not affected by these glitches. For the 8 bit case, performance is as good as using continuous 
weights (Figures 5B). Figure 5C shows a slightly reduced performance for 6 bit. Using only 4 bit 
with deterministic updates causes performance to degrade: it does not reach the same final level 
of reward (Figure 5D black trace). Using probabilistic updates improves the performance for 
4 bit close to the baseline level (Figure 5D green trace). 

So in the task studied here, there is no gain in building synapses using more than 8 bit. Because 
weight updates are controlled by a programmable processor, it is possible to switch between 
deterministic and probabilistic updating even after the system has been manufactured. In this 
context, a trade-off can be made between number of synapses and reachable performance by 
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Figure 5: Reward traces showing the running average R (only every 50 th point plotted) for dif- 
ferent weight resolutions averaged over 20 runs. (A) Baseline performance with con- 
tinuous weights. Additionally, the light gray trace shows the reward R for every trial 
of a single simulation. (B) Performance with 8 bit resolution. The lower plot shows 
the difference to the baseline model in (A). The shaded area shows the difference for 
every point in the trace instead of only for every 50 th . (C) Performance with 6 bit res- 
olution. (D) Performance with 4 bit resolution. The black trace shows the result for 
deterministic updates. The green trace for probabilistic updates. 
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Figure 6: Performance with threshold readout. As in Figure 5 the running average of the reward 
R is plotted averaged over 20 runs. The lower plots show the difference to the base- 
line trace in Figure 5A. (A) Performance traces for continuous and 8 bit weights. In 
gray reward R for every trial in a single run with continuous weights is shown. (B) 
Performance traces for 4 bit resolution with deterministic and probabilistic updates. 



using either probabilistic 4 bit or deterministic 8 bit synapses. 

3.3. Thresholded readout 

The hybrid approach of combining processor based digital computing with analog special-func- 
tion units necessitates an interface between these two. At this interface some form of analog- 
to-digital conversion (ADC) has to take place. The simplest form of ADC is comparison to 
a threshold. We next ask whether such a simple interface is sufficient for good performance 
on the learning task. Figure 6 shows performance for different weight resolutions compared to 
baseline using the thresholded readout. In contrast to the simulations shown in Figure 5, updates 
are now calculated according to Equation 16 instead of Equation 8. In particular, Equation 16 
does not directly use the eligibility trace e(£triai)> but the evaluation bits b + , 6_ determined by the 
readout mechanism (Equation 15). Performance in the case of continuous, 8 and 6 bit synapses 
(6 bit with threshold readout mechanism not shown) are nearly identical for both cases. When 
comparing traces for weights of the same resolution in Figures 5 and 6, those with threshold 
readout (Figure 6) show less variability between trials. For example, the trace of the single run in 
Figure 5A exhibits more noise than the one in Figure 6A. This is caused by the smoothing effect 
of the readout threshold, which effectively replaces extreme values of the eligibility trace e(t tt iai) 
with the update constant A = A*. The update constant A* is determined heuristically according 
to Equation 17. The data show a small improvement in final performance over baseline for 
continuous and 8 bit weights, when using the readout threshold. This can be explained by the 
reduced noise on the reward trace that allows for closer approximation of optimal weights. When 
using probabilistic updates (Figure 6B, green trace), 4 bit are enough to come close to baseline 
performance. With deterministic updates and 4 bit synapses, performance is even worse using 
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Figure 7: Difference of final reward to the baseline simulation R aftei —R^ r in units of the baseline 
standard deviation. The varied parameters are the average time constant and the amount 
of mismatch between synapses. 

threshold readout than without (black traces in Figures 5D and 6B). 

Hence the simple readout method consisting in using only a threshold comparison does not 
reduce performance. Therefore, the qualitative result from the previous section still holds: with 
deterministic updates 6 bit is enough for satisfying performance. If updates are performed in a 
probabilistic manner, 4 bit is sufficient. 

3.4. Analog drift 

In the hardware system, the eligibility trace is implemented as an analog variable inside the 
synapse circuit. It is therefore subject to drift caused by leakage currents. In Equation 18, we 
have proposed to model this using a drift function. Additionally, this behavior varies between 
synapses due to imperfections introduced by the manufacturing process. This is taken account 
for by randomly drawing parameters for the drift function from a Gaussian distribution. 

To assess the impact of this drift on the performance in the learning task, we performed a sweep 
over a number of average time constants and degrees of mismatch between synapses. The results 
of the simulation, using continuous weights and the thresholded eligibility readout described 
above, are shown in Figure 7. The gray value indicates the difference between i? a f te r and the 
baseline value R^ r (Section 3.1) in units of the standard deviation of the baseline simulation 
(darker color is better). All values fall within one standard deviation of the baseline case, which 
means that performance is only weakly sensitive to changes of time constant and mismatch of 
the eligibility trace. The best performance is achieved for r e = 0.5 s and no mismatch. This 
is equivalent to the black trace in Figure 6A, which also shows slightly improved performance 
to baseline. The improvement is explained by the smoothing effect of the threshold readout as 
discussed in Section 3.3. For very large time constants, i.e. r e = ±1000 s, drift is negligible 
compared to the trial duration = 1 s. This leads to minor deviations in the leftmost and 
rightmost columns of Figure 7. The worst performance is obtained for small time constants r e 
with large mismatch factor m e , because for r e lesser than or equal to the trial duration, the effect 
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Figure 8: Improvement in reward R after — -Rbefore by learning for a range of delays and accumu- 
lator readout noise levels. Red bars indicate the predicted maximally tolerable delay 
(Equation 21). Data is averaged over 15 simulation runs. 



of drift is more important. 

In this test, the model has shown to be robust to large deviations from the temporal behavior 
of the eligibility trace in the baseline model. Drift towards the positive and negative extrema 
of the eligibility trace, which is the opposite of the desired decaying behavior, does not affect 
performance. Neither does variation of up to 150 % of the time constant. This shows the model to 
be a well-suited candidate for implementation in neuromorphic hardware, where large variations 
and distortions are often encountered. 

3.5. Delayed reward 

In the proposed system, the simulation of the neural network is carried on by analog hardware 
elements, while the simulation of the environment is left to a conventional computer system. 
In this context, latencies due to technical reasons - e.g., by communication with the environ- 
ment or computation by the EPP- can cause temporal delays with respect to ideal calculations. 
Additionally, the analog readout of the accumulation traces a + , a_ is affected by noise. 

To better understand the impact of these effects on learning performance, a sweep over readout 
noise and reward latency values was performed, the results of which are shown in Figure 8. 
The simulation did not include mismatched drift, but used a fixed time constant of 500 ms with 
continuous weights. The gray value represents the improvement in reward by learning .Rafter — 
-Rbefore- The data shows that depending on the amount of noise learning is impaired by the delay. 
The red bars indicate the predicted maximally tolerable delay assuming a signal-to-noise ratio 
of one is required (Equation 21). The simulation fits the prediction well. A noise level of a a = 
500 pS corresponds to 50 % of the maximum of the eligibility trace a max . 

The simulation results confirm that noise on the local accumulation circuit limits tolerable 
delay. Because of the accelerated time base of the system, communication delays can easily reach 
seconds of emulated time. With an acceleration factor of a = 10 5 one second of emulated time 
is equivalent to 10 /is. So with 1 % of noise (a a = 10 pS), the round-trip-time to the environment 
must be less than 20 /is for a r e = 500 ms time constant. Equation 21 can be used to find working 
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combinations of the parameters round-trip-time, analog noise and time constant. 

4. Discussion 

In this study we have proposed a hybrid architecture for plasticity, combining local analog com- 
puting with global, program-based processing. We have then simulated a reward-modulated 
spike-timing-dependent plasticity learning rule studied by Fremaux et al. (2010) to analyze its 
implementability. Starting from a baseline case with no hardware effects, the level of hardware 
detail of the simulations was increased, with a focus on the negative effects introduced by an 
implementation using the proposed system. Note that we did not try to precisely model the hard- 
ware device, as it would be done, for example, in a transistor level simulation. Instead, our goal 
was to find the effects to which the model is sensitive in order to guide future design decisions. 

Overall, we did not find major obstacles for the proposed implementation, but we showed that 
some design choices are critical to the proper functioning of the learning rule. In the following, 
we will discuss guidelines concerning weight resolution, implementation of the eligibility trace 
and the importance of low-latency communication. After that, we will compare the design with 
other hardware systems and discuss the limitations of this study. 

4.1. Weight resolution 

For neuromorphic hardware systems using digitally represented weights, a key question is how 
many bits to use per synapse, as this determines the amount of wafer area the circuit requires. 
For networks with highly connected neurons, small synapses are important for scalability. This 
drives implementations to a reduction of the number of bits used for the weight compared to 
software simulators, which typically use a quasi-continuous 32 or 64 bit floating-point represen- 
tation. On the other hand, on-line synaptic plasticity learning rules, for example STDP, require 
incremental changes to the weights. Discretization confines these changes to a grid with a reso- 
lution determined by the number of bits. 

For the synaptic plasticity model and the learning task considered, we found that this indeed 
limits learning performance when using deterministic updates and 4 bit weights. Two solutions 
to this problem were tested: using higher resolutions and making updates probabilistically. In 
the former case, a performance comparable to the continuous case is reached with 6 bit. With 
probabilistic updates, the performance of 4 bit synapses could be improved to nearly the same 
level. Therefore, it is not necessary to build high resolution hardware synapses comparable to 
software simulators, but even a modest number of bits gives good performance. 

In Seo et al. (2011) the authors arrive at a similar result. They built a completely digital 
system in a version with 1 bit synapses and probabilistic updates and one with 4 bit synapses and 
deterministic updates. Learning performance in a benchmark task is improved in the latter case, 
but adds additional costs in area and power consumption. 

In Pfeil et al. (2012) the question of weight resolution was also studied for the BrainScaleS 
wafer-scale system using a synchrony detection task. Comparable to our findings, they report 
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8 bit weights to perform as good as floating-point weights. 4 bit weights were sufficient for 
solving the task, but did not reach the same performance. 

4.2. Implementation of the eligibility trace 

In neural models of reinforcement learning, the eligibility trace serves an important purpose: it 
allows to connect neural activity with reward. Reward typically arrives with a delay with respect 
to the activity underlying causing actions respective spikes. But only when reward arrives does 
the agent know how to change the weights. The hybrid concept of local analog accumulation 
and global processor-based weight computation fits this model very well. Therefore, we can 
identify the local circuit in the synapse with the eligibility trace. However there are two differ- 
ences. First, the processor does not have direct access to the accumulated value, but can only do 
a simple comparison operation (Equation 2). Second, there is no controlled exponential decay of 
the accumulator. The analysis in Sections 3.3 and 3.4 shows no degradation in learning perfor- 
mance by both effects. On the other hand, the lack of controlled and possibly configurable decay 
presents a constraint to the fidelity, with which learning rules can be implemented. It is not clear, 
how other learning tasks would be affected by this lack. 

4.3. Impact of real-world timings 

In the presence of delayed reward, three parameters govern whether learning is possible: 1) com- 
munication round-trip-time to the environment and back, 2) the amount of noise on the eligibility 
trace, and 3) the time constant of decay of the eligibility trace. Equation 21 allows to determine 
working combinations of them. Reducing the speed-up factor would make communication la- 
tency less of a problem, but it would require longer lasting analog storage to achieve the same 
time constant in emulated time. Small long-term analog memory is difficult to build due to leak- 
age effects. Therefore, the triangle of parameters needs to be carefully balanced. A different 
approach to deal with communication latency would be to execute the environment on the EPP 
itself. This would require adding direct access to spike times by the processor. 

4.4. Comparison to other STDP implementations 

Plasticity implementations found in the literature typically focus on variants of unsupervised 
STDP and use fixed-function hardware. For example in Indiveri et al. (2006) STDP works on 
bi-stable synapses and is implemented using fully analog circuits. In Ramakrishnan et al. (2011) 
analog floating-gate memory is used for weight storage that can be subjected to plasticity. In 
contrast, Seo et al. (2011) describes a fully digital implementation using counters and linear- 
feedback shift registers for probabilistic STDP with single-bit synapses. All three systems per- 
form weight updates immediately for individual spike pairs with a fixed algorithm. This rules out 
the ability to implement an eligibility trace to solve the distal reward problem in reinforcement 
learning (Izhikevich, 2007). 

However, there are systems that also use a general-purpose processor for plasticity. For ex- 
ample, in Vogelstein et al. (2003) an implementation of STDP in an address-event representation 
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(AER) routing system is presented. They use three individual chips: a custom integrate-and-fire 
neuron array, an SRAM based look-up table for synaptic connections and a micro-controller for 
plasticity. For STDR the micro-controller processes every spike and maintains queues of pre- 
and post-synaptic events. This necessitates multiple off-chip memory accesses for every event 
and at regular time steps. Contrary to our approach, their system has access to the detailed tim- 
ing of spikes and can therefore additionally implement rules including short-term effects, as in 
Froemke et al. (2010). However, in terms of scalability, our proposed system is superior due 
to the integration of processor, event routing and neuronal dynamics onto the same wafer. This 
reduces power consumption by eliminating communication across chip boundaries. Also, due to 
the hybrid architecture of analog accumulation and digital weight computation, the workload for 
the processor is reduced. This is an important aspect if a high speed-up factor is aimed for. 

The system reported in Davies et al. (2012) is a specialized multi -processor platform for neural 
simulations. In implementing STDR a key constraint for them is limited access to weights stored 
in external memory. They solve this problem by predicting firing times based on the membrane 
potential. This simultaneously illustrates the strength and weakness of this architecture. Since 
the system is completely digital, they have unconstrained access to state variables, such as the 
membrane potential. With analog neurons, this always requires some form of analog to digital 
conversion. On the other hand, weights are stored external to the processor and have to be 
transfered between chips. In our system, close integration of weight memory and processor on 
the same substrate in addition to the optimized input/output instructions of the SYNAPSE special- 
function unit, make weight access more efficient. 

In conclusion, the hybrid processor based architecture proposed in this study represents a 
novel plasticity implementation for hardware. To our knowledge, it introduces two novel con- 
cepts: first, the integration of a general-purpose processor for plasticity onto the neuromorphic 
substrate, and second, the close interaction with specialized analog computational units using 
an extension of the instruction set. In combination, this allows for reward-based spike-timing- 
dependent synaptic plasticity in reinforcement learning tasks. 

4.5. Limitations 

The goal of this study was to analyze the implementability of a reinforcement learning task on 
a proposed novel hardware system. The technical implementability of the system itself was 
not subject of this study. We assumed a sufficiently fast processor for the delay analysis (Sec- 
tion 2.2.4). It should be part of the design process of a future implementation to test performance 
against our simulations. The updating speed could limit the amount of plastic synapses per pro- 
cessor depending on the decay time constant r e . We also did not model the analog part of the 
system in detail, but restricted simulations to a generic drift function. Measurements in the exist- 
ing BrainScaleS wafer-scale system could be used to characterize the drifting behavior. However, 
considering that we did not see degraded performance over a large range of time constants and 
fixed-pattern variation, it does not seem likely that performance would be worse in a more accu- 
rate model. 

With regard to the model tested here, we restricted the study to one specific task of spike train 
learning, which is a generic and general learning task for spiking neurons: many tasks can be 
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formulated as a relaxed version of spike train learning. We showed that the performance of the 
model is not negatively affected by hardware constraints. It remains an open question whether 
there are other tasks that give good performance in software simulations, but fail when hardware 
constraints are included. We restricted the study to epochal learning with defined trial-duration 
ended by the application of the reward. In a next step, this approach should be extended to 
continuous time learning scenarios. In this case, processor update speed and the size of the 
decay time constant could play a more important role. 
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